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Abstract 


In sequential decision problems in an unknown environment, the decision maker often faces a 
dilemma over whether to explore to discover more about the environment, or to exploit current 
knowledge. We address the exploration-exploitation dilemma in a general setting encompassing 
both standard and contextualised bandit problems. The contextual bandit problem has recently 
resurfaced in attempts to maximise click-through rates in web based applications, a task with sig- 
nificant commercial interest. 

In this article we consider an approach of Thompson (1933) which makes use of samples from 
the posterior distributions for the instantaneous value of each action. We extend the approach by 
introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of 
playing an action increases with the uncertainty in the estimate of the action value. This results in 
better directed exploratory behaviour. 

We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour 
with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and 
measure its performance in simulated Bernoulli bandit and linear regression domains, and also 
when tested with the task of personalised news article recommendation on a Yahoo! Front Page 
Today Module data set. We find that OBS performs competitively when compared to recently 
proposed benchmark algorithms and outperforms Thompson's method throughout. 

Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential alloca- 
tion, Thompson sampling 


1. Introduction 


In sequential decision problems in an unknown environment, the decision maker often faces a 
dilemma over whether to explore to discover more about the environment, or to exploit current 
knowledge. We address this exploration-exploitation dilemma in a general setting encompass- 
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ing both standard bandit problems (Gittins, 1979; Sutton and Barto, 1998; Auer et al., 2002) and 
contextual-bandit problems (Graepel et al., 2010; Li et al., 2010; Auer, 2002; Yang and Zhu, 2002). 
This dilemma has traditionally been solved using either ad hoc approaches like €-greedy or softmax 
action selection (Sutton and Barto, 1998, Chapter 2) or computationally demanding lookahead ap- 
proaches such as Gittins indices (Gittins, 1979) which provably satisfy an optimality criterion with 
respect to cumulative discounted reward. However, the lookahead approaches become intractable 
in all but the simplest settings and the ad hoc approaches are generally perceived to over-explore, 
despite providing provably optimal long term average reward. 

In recent years, Upper Confidence Bound (UCB) methods have become popular (Lai and Rob- 
bins, 1985; Kaelbling, 1994; Agrawal, 1995; Auer et al., 2002), due to their low computational cost, 
ease of implementation and provable optimality with respect to the rate of regret accumulation. 

In this article we consider an approach of Thompson (1933) which uses posterior distributions 
for the instantaneous value of each action to determine a probability distribution over the available 
actions. Thompson considered only Bernoulli bandits, but in general the approach is to sample a 
value from the posterior distribution of the expected reward of each action, then select the action 
with the highest sample from the posterior. Since in our generalised bandit setting the samples 
are conditioned on the regressor, we label this technique as Local Thompson Sampling (LTS). The 
technique is used by Microsoft in selecting adverts to display during web searches (Graepel et al., 
2010), although no theoretical analysis of Thompson sampling in contextual bandit problems has 
been carried out. 

When these posterior samples are represented as a sum of exploitative value and exploratory 
value, it becomes clear that LTS results in potentially negative exploratory values. This motivates a 
new algorithm, Optimistic Bayesian Sampling (OBS), which is based on the LTS algorithm, which 
is modified by replacing negative exploratory value with a zero value. 

We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour in 
the long term consistency sense described by Yang and Zhu (2002). These proofs use elementary 
and coupling techniques. 

We also implement LTS and OBS and measure their performance in simulated Bernoulli bandit 
and linear regression domains, and also when tested with the task of personalised news article 
recommendation on the the Yahoo! Front Page Today Module User Click Log Data Set (Yahoo! 
Academic Relations, 2011). We find that LTS displays competitive performance, a view shared by 
Chapelle and Li (2011), and also that OBS outperforms LTS throughout. 


1.1 Problem Formulation 


An agent is faced with a contextual bandit problem as considered by Yang and Zhu (2002). The 
process runs for an infinite sequence of time steps, t € T = (1,2,...). At each time step, t, a 
regressor, x, € X, is observed. An action choice, a; € A, A = {1,...,A},A < œ, is made and a 
reward r; € R is received. 

The contextual bandit framework considered assumes that reward can be expressed as 


p= fa (x;) + Zt a 


where the Zsa are zero mean random variables with unknown distributions and fa : X — R is an 
unknown continuous function of the regressor specific to action a. The stream of regressors x; is 
assumed not to be influenced by the actions or the rewards, and for simplicity we assume that these 
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are drawn independently from some fixed distribution on X.! For our actions to be comparable, 
we assume that Va € A, Vt € T, Vx € X, fa(x) 4- z; is supported on the same set, 5. Furthermore to 
avoid boundary cases we assume that Va € A 





sup fa (x) < sups. (1) 
xeX 
In situations where the z, a have unbounded support, 5 = IR, and (1) is vacuous if X is compact. The 
condition is meaningful in situations where .$ is compact, such as if rewards are in {0,1}. 


Definition 1 The optimal expected reward function, f* : X — R, is defined by 


f' (x) = max fa (x). 
acA 
A minimal requirement for any sensible bandit algorithm is the average reward convergence crite- 
rion of Yang and Zhu (2002), which identifies whether a sequence of actions receives, asymptot- 
ically, rewards that achieve this optimal expected reward. Hence the main theoretical aim in this 
article is to prove under mild assumptions that LTS and OBS constructs a sequence of actions such 





that : 
= X 
ssl fa %s) 2$ Last — oo. (2) 
* 
SEDE (xs) 

The choice of action a, is based on the current and past regressors, {x1,...,x;}, past action 
choices, (a1,...,a; 1), and past rewards, (r1,...,7;.1). Denote 7, = 0 and, for all times {t € 7 : 
t > 2), denote 

I, = Docet so qu a ere ae ees eee 7 ae 


Furthermore denote all of the prior information available as Jp and also all the information available 
at time f as I, (= bU I). 


Definition 2 The policy, (n.(-)) rep? ÍS a Sequence of conditional probability mass functions where 
T; (a) = P(a = al I,,x;). At each time step t, the policy maps I, and x; to a probability mass function 
giving the probability of each action being selected. 


The policy is constructed in advance of the process, using only Jp, and is the function used to map 
I, and x; to action selection probabilities for each of the actions. 

Note also that, under a Bayesian approach, the information sets J, result in posterior distributions 
for quantities of potential interest. In particular Jp defines the assumed functional forms of the fa, 
and a prior distribution over the assumed space of functions, which is then updated as information 
is received, resulting in a Bayesian regression procedure for estimating the reward functions fy, and 
hence a posterior distribution and expectation of f,(x;) conditional on the information set I, U {x;}. 

We do not however formulate an exact probability model of how regressors are sampled, rewards 
are drawn and inference is carried out. Instead we rely on Assumptions 1—5 placed on the Bayesian 
regression framework, given in Section 3, that will be satisfied by standard models for the x;, r, and 
prior information Jp. In particular, randomness resulting from the regressor and reward sequences 
are controlled through these assumptions, whereas our proofs control the randomness due to the 





1. Note that this assumption of iid sampling from X is only used in the latter part of the proof of Theorem 1. In 
fact an ergodicity condition on the convergence of sample averages would suffice, but would increase the notational 
complexity of the proofs. 
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action selection method. A useful framework to keep in mind is one in which regressors are drawn 
independently from a distribution on a compact Euclidean space X, each zz a is a Gaussian random 
variable independent of all other random variables, and the prior information Jp includes that each 
fa is a linear function, and a prior distribution over the parameters of these functions; we revisit this 
model in Section 4.2 to demonstrate how this framework does indeed ensure that all the Assumptions 
are satisfied. However much more general frameworks will also result in our Assumptions being 
satisfied, and restricting to a particular probability model at this point will unnecessarily restrict the 
analysis. 


1.2 Algorithm Motivation 


The choice of algorithm presented in this article is motivated by both infinite and finite time consid- 
erations. The first subsection of this section describes desirable infinite time properties for an algo- 
rithm that are of importance in proving optimality condition (2). The second subsection describes, 
in a heuristic manner, desirable finite time properties to help understanding of the motivation behind 
our choice of algorithm, as opposed to the many other algorithms that also satisfy the infinite time 
requirements. 


1.2.1 INFINITE TIME CONSIDERATIONS 


In conventional interpretations of similar problems (Littman, 1996; Singh et al., 2000; Sutton and 
Barto, 1998), there are two major aspects of generating a policy. The first is developing an evaluation 
scheme and the second an action selection scheme. 

So that the agent can evaluate actions, a regression procedure is used to map the current regressor 
and the history J; to value estimates for the actions. Denote the agent's estimated value of action 
a at time t when regressor x is presented as fa (x). Since Êa is intended to be an estimate of fa, 
it is desirable that the evaluation procedure is consistent, that is, Va € A, Vx € X, du (x) — fa(x) 
converges in some sense to 0 as nta — ce, where n,a is the number of times action a has been 
selected up to time t. Clearly such convergence will depend on the sequence of regressor values 
presented. However consistency of evaluation is not the focus of this work, so will be assumed 
where necessary and the evaluation procedure used for all algorithms compared in the numerical 
experiments in $4 will be the same. The main focus of this work is on the action selection side of 
the problem. 

Once action value estimates are available, the agent must use an action selection scheme to 
decide which action to play. So that the consistency of estimation is achieved, it is necessary that 
the action selection ensures that every action is selected infinitely often. In this work, we consider 
algorithms generating randomised policies as a way of ensuring infinite exploration is achieved. 

In addition to consistent evaluation and infinite exploration, it is also necessary to exploit the 
obtained information. Hence the action selection method should be greedy in the limit, that is, the 
policy 7; is designed such that 


3 Tu (a) — Last >, 


acargmax,c 4 fia (x) 


These considerations result in the consideration of GLIE (greedy in the limit with infinite ex- 
ploration) policies, for which action selection is greedy in the limit and also guarantees infinite 
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exploration (Singh et al., 2000). We combine a GLIE policy with consistent evaluation to achieve 
criterion (2). 


1.2.2 FINITE TIME CONSIDERATIONS 


As well as convergence criterion (2), our choice of algorithm is also motivated by informal finite 
time considerations, since many algorithms for which (2) holds are perceived to explore more than 
is desirable. We note that formal optimality criteria are available, such as expected cumulative dis- 
counted reward (Gittins, 1979) and rate of regret accumulation (Auer et al., 2002). However an 
analysis of Thompson sampling under these criteria has proved elusive, and our heuristic approach 
inspires a modification of Thompson sampling which compares favourably in numerical experi- 
ments (see Section 4). In this section, we discuss the short term heuristics. 

In particular, consider the methodology of evaluating both an exploitative value estimate and an 
*exploratory bonus' at each time step for each action, and then acting greedily based on the sums 
of exploitative and exploratory values (Meuleau and Bourgine, 1999). An action's exploitative 
value estimate corresponds to the expected immediate reward (i.e., expected reward for the current 
timestep) from selecting the action, given information obtained so far, and therefore the posterior 
expectation of expected immediate reward is the appropriate exploitative action value estimate. 


Definition 3 Let p,(-| I,,x;) denote the posterior distribution of fa(x;) given I, and x, and let Q™ 
be a random variable with distribution pq(-| L,x;). The exploitative value, f, a(x), of action a at 
time t is defined by 











falı) = e(o" | 1,1). 


Thompson (1933) suggests selecting action a; with probability equal to the probability that a; is 
optimal, given J; (there is no regressor in Thompson’s framework). This principle has recently been 
used by Graepel et al. (2010), who implement the scheme by sampling, for each a, Qr from the 





posterior distribution p,(-| I,,x;) and selecting an action that maximises Q^. This corresponds to 
using an exploratory value f/^ (x,) := QT" — f, , (x) which is sampled from the posterior distribution 


of the error in the exploitative action value estimate at the current regressor. We name this scheme 
Local Thompson Sampling (LTS), where ‘local’ makes reference to the fact that action selection 
probabilities are the probabilities that each action is optimal at the current regressor. Under mild 
assumptions on the posterior expectation and error distribution approximations used, one can show 
that Local Thompson Sampling guarantees that convergence criterion (2) holds (see Theorem 1). 
However the exploratory value Ee (x;) under LTS has zero conditional expectation given J, and 
x, (by Definition 3) and can take negative values. Both of these properties are undesirable if one as- 
sumes that information is useful for the future. One consequence of this is that, in regular situations, 
the probability of selecting an action âf € argmax,c 4 ha (x+) decreases as the posterior variance of 
fas (x1) — Ê æ (xr) increases, that is, if the estimate for an action with the highest exploitative value 
has a lot of uncertainty then it is less likely to be played than if the estimate had little uncertainty. 

To counteract this feature of LTS, we introduce a new procedure, Optimistic Bayesian Sampling 
(OBS) in which the exploratory value is given by 


frau) = max (0, FTP (x) — fra(%)). 


This exploratory value has positive conditional expectation given J; and x; and cannot take negative 
values. The exploratory bonus results in increased selection probabilities for uncertain actions, a 
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desirable improvement when compared to LTS. In $3, we show that OBS satisfies the convergence 
criterion (2) under mild assumptions. Furthermore, simulations described in $4 indicate that the 
OBS algorithm does indeed outperform LTS, confirming the intuition above. 


1.3 Related Work 


There are three broad classes of exploration approach: undirected, myopic and belief-lookahead 
(Asmuth et al., 2009). In undirected exploration, the action selection distribution depends only on 
the values of the exploitative action value estimates. Examples of undirected exploration include 
€-greedy and softmax action selection (see Chapter 2 of Sutton and Barto, 1998). In general, the 
short term performance of undirected methods is restricted by the fact that estimate uncertainty is 
not considered. 

At the other end of the spectrum, in belief-lookahead methods, such as those suggested by 
Gittins (1979), a fully Bayesian approach is incorporated in which the action yielding the highest 
expected cumulative reward over the remainder of the process is selected,” thereby considering ex- 
ploitative and exploratory value both directly and simultaneously and providing the optimal decision 
rule according to the specific criterion of maximising expected cumulative discounted reward. Ac- 
cording to Wang et al. (2005),“‘in all but trivial circumstances, there is no hope of exactly following 
an optimal action selection strategy". Furthermore, even when it is possible to evaluate the optimal 
decision rule, “the optimal solutions are typically hard to compute, rely on artificial discount factors 
and fail to generalise to realistic reward distributions" (Scott, 2010). There is also the issue of ‘in- 
complete learning’; Brezzi and Lai (2000) showed that, for standard bandit problems, Gittins’ index 
rule samples only one action infinitely often and that this action is sub-optimal with positive prob- 
ability. If the modelling assumptions and posterior approximations used are accurate, then this is a 
price worth paying in order to maximise expected cumulative discounted reward. However, if the 
posterior approximation method admits a significant error, then it may be that a too heavy reliance 
is placed on early observations. For these reasons, Gittins-type rules are rarely useful in practice. 

In myopic methods, the uncertainty of action value estimates is taken into account, although the 
impact of action selections on future rewards is not considered directly. The exploratory component 
of myopic methods aims to reduce the uncertainty at the current regressor without explicitly con- 
sidering future reward. By reducing uncertainty at each point presented as a regressor, uncertainty 
is reduced globally ‘in the right places’ without considering the regressor distribution. Myopic ac- 
tion selection can be efficient, easy to implement and computationally cheap. The LTS and OBS 
methods presented in this paper are myopic methods. The other main class of myopic methods 
are the upper confidence bound methods, which are now popular in standard and contextual bandit 
applications, and in some settings can be proved to satisfy an optimality criterion with respect to 
the rate of accumulation of regret (for an overview, and definitions of various notions of regret, see 
Cesa-Bianchi and Lugosi, 2006). 

Inspired by the work of Lai and Robbins (1985) and Agrawal (1995), Auer et al. (2002) proposed 
a myopic algorithm, UCBI, for application in standard bandit problems. The exploratory value at 
time ź for action a, which we denote fea takes the simple form 


> 2log(t — 1) 
ta - = E 
Nt a 


, 





2. Note that this is only meaningful in the case of discounted rewards or if the time sequence is finite. 
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Infinite exploration is guaranteed by the method, since the exploratory value grows in periods 
in which the associated action is not selected. Moreover, Auer et al. (2002) prove that the ex- 
pected finite-time regret is logarithmically bounded for bounded reward distributions, matching the 
(asymptotically) optimal rate derived by Lai and Robbins (1985) uniformly over time. Auer et al. 
(2002) also propose a variant of UCB1, named UCB-Tuned, which incorporates estimates of the 
reward variances, and show it to outperform UCB1 in simulations, although no theoretical results 
are given for the variant. 

Two recently-proposed variants of the UCBI algorithm are the MOSS (Minimax Optimal Strat- 
egy in the Stochastic case) algorithm (Audibert and Bubeck, 2010) and the UCB-V algorithm (Au- 
dibert and Bubeck, 2009). The MOSS algorithm is defined for finite problems with known horizon 
|T|, but the ‘doubling trick’ described in $2.3 of Cesa-Bianchi and Lugosi (2006) can be used if the 
horizon is not known. MOSS differs from UCB1 by replacing the log(t — 1) term in the exploratory 
gorithm incorporates estimates of reward variance in a similar way to the UCB-Tuned algorithm. 
The UCB-Tuned, MOSS and UCP-V algorithms provide suitable benchmarks for comparison in 
Bernoulli bandit problems. 

Another class of ‘UCB-type’ algorithms was proposed initially by Lai and Robbins (1985), 
with a recent theoretical analysis by Garivier and Cappé (2011). The evaluation of action values 
involves constrained maximisation of Kullback-Leibler divergences. The primary purpose of the 
KL-UCB algorithm is to address the non-parametric problem although parametric implementation is 
discussed and optimal asymptotic regret bounds are proven for Bernoulli rewards. In the parametric 
case, a total action value corresponds to the highest posterior mean associated with a posterior 
distribution that has KL divergence less than a pre-defined term increasing logarithmically with 
time. A variant of KL-UCB, named KL-UCB+ is also proposed by Garivier and Cappé (2011) and 
is shown to outperform KL-UCB (with respect to expected regret) in simulated Bernoulli reward 
problems. Both algorithms also serve as suitable benchmarks for comparison in Bernoulli bandit 
problems. 

For contextual bandit problems, Interval estimation (IE) methods, such as those suggested by 
Kaelbling (1994), Pavlidis et al. (2008) and Li et al. (2010) (under the name LinUCB), have become 
popular. They are UCB-type methods in which actions are selected greedily based on the upper 
bound of a confidence interval for the exploitative value estimate at a fixed significance level. The 
exploratory value used in IE methods is the difference between the upper bound and the exploitative 
value estimate. The width of the confidence interval at a particular point in the regressor space is 
expected to decrease the more times the action is selected. 

There are numerous finite-time analyses of the contextual bandit problem. The case of lin- 
ear expected reward functions provides the simplest contextual setting and examples of finite-time 
analyses include those of the SupLinRel and SupLinUCB algorithms by Auer (2002) and Chu et al. 
(2011) respectively, in which high probability regret bounds are established. The case of gener- 
alised linear expected rewards is considered by Filippi et al. (2010), proving high probability regret 
bounds for the GLM-UCB algorithm. Slivkins (2011) provides an example of finite-time analysis 
of contextual bandits in a more general setting, in which a regret bound is proved for the Contextual 
Zooming algorithm under the assumptions that the joint regressor and action space is a compact 
metric space and the reward functions are Lipschitz continuous over the aforementioned space. 

On the other hand, very little is known about the theoretical properties of Thompson sampling. 
The only theoretical studies of Thompson sampling that we are aware of are by Granmo (2008) 


value with log ( ) and hence selecting intensively drawn actions less often. The UCB-V al- 
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and Agrawal and Goyal (2011). The former work considers only the two-armed non-contextual 
Bernoulli bandit and proves that Thompson sampling (the Bayesian Learning Automaton, in their 
terminology) converges to only pulling the optimal action with probability one. The latter work con- 
siders the K-armed non-contextual Bernoulli bandit and proves an optimal rate of regret (uniformly 
through time) for Thompson sampling. In this work, we focus on proving convergence criterion (2) 
for the LTS and OBS algorithms in a general contextual bandit setting in $3 and perform numerical 
experiments in $4 to illustrate the finite time properties of the algorithms. 


2. Algorithms 


In this section, we describe explicitly how the action selection is carried out at each decision instant 
for both the LTS and the OBS algorithms. 

At each time t, the LTS algorithm requires a mechanism that can, for each action a € A, be 
used to sample from the posterior distribution of f,(x;) given regressor x, and information set Ļ. 
Recall that the density of this distribution is denoted as p;(:| I, x;) and a random variable from the 
distribution as Qr. 


Algorithm 1 Local Thompson Sampling (LTS) 
Input: Posterior distributions {p4 (| 5, x) : a € A} 
for a= 1 to A do 

Sample Q?^ ~ p,(| 1,;) 
end for 
Sample a, uniformly from argmax,c 4 oF 








As in the case of the LTS algorithm, at each time t, the OBS algorithm requires a mechanism that 
can, for each action a € A, be used to sample from the posterior distribution of fa(x;) given regressor 
x, and information set J. Additionally, the OBS algorithm requires a mechanism for evaluating 
exploitative value p (x;), where exploitative value is taken to be the posterior expectation of fa(x;) 
given I, and x. 


Algorithm 2 Optimistic Bayesian Sampling (OBS) 
Input: Posterior distributions (p4(-|I,x;) : a € A} 
for a = 1 to A do 

Sample Ora ~ Pa(-|L,%1) 
Evaluate Fü a= (QT Ix) 
Set Q; a = max(QTh. f, ,(x,)) 
end for 
Sample a; uniformly from argmax,. 4 Qr,a 




















3. Analysis 


Theoretical properties of the LTS and OBS algorithms are analysed in this section. In particular, 
we focus on proving convergence in the sense of (2) under mild assumptions on the posterior dis- 
tributions and expectations used. Regret analysis would provide useful insight into the finite time 
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properties of the LTS and OBS algorithms. However, we consider the problem in a general setting 
and impose only weak constraints on the nature of the posterior distributions used to sample action 
values, making the type of regret analysis common for UCB methods difficult, but allowing the 
convergence result to hold for a wide class of bandit settings and posterior approximations. 


3.1 LTS Algorithm Analysis 


We begin our convergence analysis by showing that the LTS algorithm explores all actions infinitely 
often, thus allowing a regression procedure to estimate all the functions fa. In order to do this we 
need to make some assumptions. 

To guarantee infinite exploration, it is desirable that the posterior distributions, pa(-|-,-), gen- 
erating the LTS samples are supported on (inf.$, sup.$), a reasonable assumption in many cases. 
We make the weaker assumption that each sample can be greater than (or less than) any value in 
(inf.S, sup.$) with positive probability. For instance, this assumption is satisfied by any distribution 
supported on (inf.$, inf.$ + 6;) U (sup S — 02, sup S) for 6;,05 > 0. 

It is also desirable that the posterior distributions remain fixed in periods of time in which 
the associated action is not selected, also a reasonable assumption if inference is independent for 
different actions. We make the weaker assumption that, in such periods of time, a lower bound exists 
for the probability that the LTS sample is above (or below) any value in (inf.$, sup.$). Formally, we 
make the following assumption: 


Assumption 1 Let a € A be an arbitrary action, let T be an arbitrary time, let Ir be an arbitrary 
history to time T, and let M € (inf.S,sup.S). There exists an € > 0 depending on a, T, Ir and M 
such that for all t T, all histories 


L= Ir U or eos PTS GINS OT EA 
such that a, #a for s € {T,...,t—1}, and all x, € X 
P(Q > M|I,x) >€ 


and 





P(Q? < M|I,,x)) > €. 


Along with Assumption 1, we also assume that the posterior distributions concentrate on func- 
tions of the regressor bounded away from sup.$ as their associated actions are selected infinitely 
often. Formally, we assume that: 


Assumption 2 For each action a € A, there exist a function g, : X — (inf S,sup S) such that 
; P 
(i) [lon — ga(xr)| — 0 as y a — eo, 
(ii) SUP, <x Sa (x) < sup S. 


We do not take g4 = fa since this allows us to prove infinite exploration even when our regression 
framework does not support the true functions (e.g., when Jp supports only linear functions, but the 
true f; are actually non-linear functions). Furthermore, the second condition, when combined with 
Assumption 1, ensures that over periods in which action a is not selected there is a constant lower 
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bound on the probability that either the LTS or OBS algorithms sample a on value greater than any 
ga(x). 

Although there is an apparent tension between Assumption 1 and Assumption 2(i), note that 
Assumption 1 applies to the support of the posterior distributions for periods in which associated 
actions are not selected, whereas Assumption 2(1) applies to the limits of the posterior distributions 
as their associated actions are selected infinitely often. 

Lemma 2 shows that, if Assumption 1 and 2 hold, then the proposed algorithm does guarantee 
infinite exploration. The lemma is important as it can be combined with Assumption 2 to imply 
that, for all a € A, 

[os — ga(x:)] F 0ast— o 


since Va € A4, nta — © as t — œ. The proof of Lemma 2 relies on the following lemma (Corollary 
5.29 of Breiman, 1992): 


Lemma 1 (Extended Borel-Cantelli Lemma). Let I, be an increasing sequence of o-fields and let 
V, be L41-measurable. Then 


fo P(V|L)— J = {0:0 € V, infinitely often} 
t=0 


holds with probability 1. 


Lemma 2 /f Assumption I and 2 hold, then the LTS algorithm exhibits infinite exploration with 
probability 1, that is, 


°(U (narrare) zu 


acA 


Proof Fix some arbitrary k € {2,...,A}. Assume without loss of generality that actions in Af = 
{k,...,A} are selected infinitely often and actions in 25^ = {1,...,k— 1} are selected finitely often. 
By Assumption 2 and the infinite exploration of actions in 4'™, we have that for all actions ai™ € 
AÍ there exists a function gar : X — (inf S,supS) such that 


P 
[Oar — gat (x4 )] — 0ast — o. 


Therefore, for fixed 5 > 0, there exists a finite random time, Ts, that is the earliest time in J such 
that for all actions af € 4f we have 


P (|Q hint = Bint (x; )| « ò 





L, x,t > Ts) 21— 8. (3) 


Note that, by Assumption 2, we can choose 6 to be small enough that such that for all actions a € A 
and regressors x € X, 
ga(x) +6 < supS. (4) 


Since all actions in f" are selected finitely often, there exists some finite random time Ty that 
is the earliest time in T such that no action in A® is selected after Ty. Let T = max(T5, Tp}. From 
(4) and Assumption 1 we have that for each a" € 45" ] there exists an £5 > 0 such that 





P(o < max ga (xr) TO L, x,t > r) > Efn, (5) 
, a 
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and also that there exists an €; > 0 such that 


(on Smee. (jdn 
d acA 





kar» T) >£. (6) 
Define the events: 


— ő 
G; a 


= { d > max g,(x;)) +8}, 
acA 


5 
Gra 


GÀ, = { p < max gq(x;) +8}, 
$ acA 


107 — e) < 3}. 


Then the LTS action selection rule implies that 
—= ô 
&( n e^)e( n &.)ew-a 
qin eAfin\ | afe ginf 


so that 


>5 
P(a; — 1|,x,T >t) > «e. N | N Ba) N ( N Ca) 


qin €Afin\ 1 afe ginf 





ior) (7) 


The set iG. GÈ, :a=2,...,k—1,b=k,...,A} is a conditionally independent set of events 
given I, and x,. Therefore, by (3), (5) and (6), we have 


— ő 
«(a f) | N 2 1 | N 2) 
afn c fin \ 1 aint c ginf 


where € = mins» 48» Ein. Combining (7) and (8), it follows that 





Lets r) se igqegytrH (8) 


P(a, = 1| 5, x,t >T) > e 1(1—8)4 


so that 





IV 


\| 
SM: Ms SM: 


Y Pla = 14,1) 


teT 


(a; = 1|L,,x;) 


~ 
T 
= 











P(a, = l|. X,t > T) 


~ 
T 
= 


gl (1 pa prre ees 


V 





~ 
T 
— 


since T is almost surely finite. Hence, by Lemma 1, (a; = 1} occurs infinitely often almost surely, 
contradicting the assumption that 1 € Af”. Since action 1 was chosen arbitrarily from the set 4^”, 
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any action in Af” would cause a contradiction. Therefore, 4" = Q, that is, every action is selected 
infinitely often almost surely. a 


When we return to the notion of exploitative value estimates f, ,(x;) and hence the concept of a 
greedy action, then we wish to ascertain whether the algorithm is greedy in the limit. Assumption 
2 only implies that the sum of exploitative and exploratory values tends to a particular function of 
the regressor and not that the exploratory values tend to zero. Although a minor point, the infinite 
exploration, given by Assumption 1 and 2, needs to be complemented with an assumption that the 
exploitative value estimates are converging to the same limit as the sampled values QT^ in order 
to prove that the policy generated by the LTS algorithm is GLIE. This assumption is nol used in 
proving that the LTS algorithm generates policies satisfying convergence criterion (2) but is used 
for the equivalent proof for the OBS algorithm (see $3.2). 


Assumption 3 For all actions a € A 


A P 
[Fra (Xr) = 8a(xr) | — 0 as ni a — oo 
for &q defined as in Assumption 2. 
Lemma 3 /f Assumptions 1, 2 and 3 hold, then the LTS algorithm policy is GLIE. 


Proof For any a € A, since 
foa = Or, EN tea (xr), 
Assumptions 2 and 3 give 
Eas) 5 0 as Nta > ©. (9) 


Since Assumptions 1 and 2 are satisfied, infinite exploration is guaranteed by Lemma 2. This infinite 
exploration and (9) imply that Va € A 


ffs) 5 0 as t — o. (10) 
Let us denote the set 


Ay = argmax f (sj). 


acA 


By splitting value samples into exploitative and exploratory components we have 


ix) 


2 (musst) + max FT) > max [456-0269 
tx) 


since the right hand side of the last inequality converges in probability to 0 by (10) and 





P(a cA Ix) = P| max > max 
t A tot ( maxon acA\ At “OF 











ix) 














> (ma fl) mes fl) > meg 


** 1 ast — ce, 


max f. xX;) > max f, X; 
aca fral i) mex Jia i) 
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by definition of A. Hence, the action selection is greedy in the limit. Lemma 2 ensures infinite 
exploration, so the policy is GLIE. E 


We have shown we can achieve a GLIE policy even when we do not have consistent regression. 
However, to ensure the convergence condition (2) is satisfied we need to assume consistency, that 
is, that the functions g, (to which the Qr converge) are actually the true functions fa. 


Assumption 4 For all actions a € A and regressors x € X, 
8a(x) = fa(x). 


The following Theorem is the main convergence result for the LTS algorithm. Its proof uses the 
fact that, under the specified assumptions, Lemma 2 implies that, for all actions a € A, 


[or — fa(x)] Bie 0 as t — oo. 


We then use a coupling argument (dealing with the dependence in the action selection sequence) to 
prove that the LTS algorithm policy satisfies convergence criterion (2). 


Theorem 1 /f Assumptions 1, 2 and 4 hold, then the LTS algorithm will produce a policy satisfying 
convergence criterion (2). 


Proof Recall that the optimal expected reward function is defined by f*(x) = maxgeg fa(x). Fix 
some arbitrary 6 > 0. Denote the event 


EP = {f° (31) - f.) < 23} 
so that E> is the event that true expected reward for the action selected at time t is within 28 of the 
optimal expected reward at time f. 
The first part of the proof consists of showing that 
P(EÈ| 1, x;) 53 1 as t — œ. 
From Assumptions 2 and 4, and the infinite exploration guaranteed by Lemma 2, Va € A 
[or — fa(x)] B 0 as t — o. 


Therefore there exists a finite random time, 7s, that is the earliest time in J such that Va € A 


*(io fats <6 





2) 21-6 (11) 


so that, after 75, all sampled Qr values are within 6 of the true values with high probability. 
Define the events 


Ej = {Onn = f 0) «8r. 


2081 


MAY, KORDA, LEE AND LESLIE 


Then im. :a € A} is a conditionally independent set of events given J, and x;, so that 


P( (eaters Ts) = [[PGS xt» Ts) > (1-3)4 (12) 


acA acA 


using inequality (11). 
Note that, for any a; € argmaxyeq fa(%), 


(155 C {Ora > f° Or) — 3} (13) 
acA 

and, for any a’ € (a € A: f* (x1) — falx) > 28}, 
[15 eio d ey s sk (14) 
acA 


Since argmax,. 4 fa(x;) is non-empty and the action selection rule is greedy on the Q7^ a» Statements 
(13) and (14) give 
(ES C P o) — Fu) « 28) EX 


acA 
and so 
r( N 2) < P(E?|1,x;). (15) 
acA 


Inequalities (12) and (15) imply that 
P(E] yt ST) > 18): 
The condition above holds for arbitrarily small 5 so that Vx € X 
IP(E9| 1,,x,) 53 1 as t — oo. (16) 


This concludes the first part of the proof. We have shown that the probability that the action 
selected at time ¢ has a true expected reward that is within 26 of that of the action with the highest 
true expected reward at time ¢ tends to 1 as t — œ. We now face the difficulty that the strong law of 
large numbers cannot be used directly to establish a lower bound on lim; 1 Y fa, (xs) since the 


expected reward sequence ( fa, (5) T is a sequence of dependent random variables. 
sc 


The result may be proved using a coupling argument. We will construct an independent se- 
quence of actions b, that are coupled with as, but for which we can apply the strong law of large 
numbers to fp, (xs). By relating the expected reward for playing the b, sequence to that of the a; 
sequence we will show that the a; sequence satisfies the optimality condition (2). 

Fix some arbitrary € > 0, define the sets 


Ay = {aE A: f* (x) — falar) < 28}, 
and let U; ,U2,... be a sequence of independent and identically distributed U [0, 1] random variables. 
The construction of E? and Af implies that E? = (a, € AF}. So, by conditioning on the event 
(a, € AÈ} and using the LTS action selection rule, it follows that a; can be expressed as 
S { argmax,< ge OV if Us < P(ES| L, xs) 
: argMaX yc a ae QU» if U, > PEF Lx) 
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with ties resolved using uniform sampling. 
We similarly define b, based on the U, as 


AR argminge ge fa(x,) if Us << 1—€ 
argmingeg fa(x,) if Us > 1—e, 


again, with ties resolved using uniform sampling. Note that, since the U, and x, are independent 
and identically distributed, the b, are independent and identically distributed, and so is the sequence 


fb, (xs). 


Note that by (16) there exists a finite random time 
Se = wp; < oœ: P(EF| G, x) < 1 -e}. 
By considering the definition of S;, it follows that 
{s >Se}N{U, < 1— e€} c (U, < P(E;| Lh, xs)} 
C fa, € argmax om) 
ac Ay 
C fa, € AR} 
C { fa, (xs) 2 min fi(x,) } a7) 
be AE 
Also, it is the case that 


{U,<1—-e} = fbs € argmin fo(s) } 


beca 
C [fi (3) = min fi) }- (18) 
Combining (17) and (18), we have that 
{s > Se} {Us < 1-8} c (0) = fo, 63) )- (19) 
Note also that 
{Us > 1-8) C [fi (2s) = min fa (x1) < fi 03) J- (20) 


It follows from (19), (20) and the definition of f* that 
{5 > Se} C UP (xs) = fa (9) 2 fo, (%s) } 


and so 


~| = 


t t 1 t 
YF &)2—Y fale) 2 = V sinG). (21) 
S=Se 


t 
S=S_ s=Se 
We will now use inequality (21) to prove the result. The definition of b, implies that 


{Us <1—€} c {bs e Ay}. 
By considering the definition of A, it follows that 
{Us zd e} € { fo, (xs) > f“ (xs) E 2e}. (22) 


Since 
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e S. is finite 
e the U, are independent and identically distributed 
e the fj. (xs) are independent and identically distributed 


we can use the strong law of large numbers and (22) to get 








tim |: E f (9) | = Euccf 
$S—Sg 
= P(U, < 1 — eJEx [fo G)|U, < 1 — £] + P(U, > 1 — e)Ex[fo,Q)|U, 1 — €] 
-( -e) (Ex C) ~2e) Ex [fj (xs)|Us > 1 — 8], Q3) 










































































where Exx denotes expectation taken with respect to the joint distribution of U, and x, and Ex 
denotes expectation taken with respect to the distribution of x; (note that both distributions are the 
same for all values of t). 

By the strong law of large numbers, we get 














im [F E e| = Exin Q4) 


1—oo 
s=Se 


Since (21), (23) and (24) hold, we have that 














Exf*(- De lim $ y fasl Xs ) 


1—oo $55. 
£ 
> lim X, 
Z D $ x Ta t Ki ) 
e 


























>(1 -e( zxf*() —2e) + Ex (fy, (x5) |Uy > 1 — 8]. 


This holds for arbitrarily small €, hence 














tin [7 È fas] = Bas") (25) 
It is the case that 
1 Sex! 
ime. PE Xs) )- lim = oe Ja (Xs) )+ lim — DXX Xs) 
=0+ lim ir y pr] (26) 
SS. 


as t — co since S; is finite and fa, (xs) < sup,cy fur (x) < ©. 
Since both (25) and (26) hold, it is true that 


lim DXX = ix f” (= tim [7] X^ e). 





= 
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Hence 


m fa, (xs) 
Xs f* (xs) 





a.s. 
5 last — œ. 


3.2 OBS Algorithm Analysis 


We analyse the OBS algorithm in a similar way to the LTS algorithm. In order to prove infinite 
exploration for the OBS algorithm, we must make an additional assumption on the exploitative 
value estimates. We assume that exploitative values are less than sup.$ by a constant for all regressor 
values during periods of time in which their associated actions are not selected. This allows us to 
make statements similar to inequality (5) in the proof of Lemma 2, however relating to OBS samples 
rather than LTS samples. 


Assumption 5 Let a € A be an arbitrary action, let T be an arbitrary time, and let Ir be an 
arbitrary history to time T. There exists a 6 > 0 depending on a, T, and Ir such that for all t > T, 
all histories I, = Ir Uxr,... Xi A, FT, Fi A AT, 011] such that as afor s € (T,...,t — 1), 
and all x € X, 


sups — fia(x) > 6. 


We now show that the OBS algorithm explores all actions infinitely often. Assumptions 2 and 3 
imply that, for any action a € A, 


[Ora — ga(x.)] = 0 as nq — co 


so that OBS samples associated with actions assumed to be selected infinitely often can be treated in 
the same way as LTS samples are in the proof of Lemma 2. The only slight difference in the proof 
comes in the treatment of samples associated with actions assumed to be selected finitely often, 
although Assumption 5 ensures that the logic is similar. 


Lemma 4 /f Assumption 1, 2, 3 and 5 hold, then the OBS algorithm exhibits infinite exploration 
with probability 1. 


Proof Since Q; = max(Q!", f; ,(x;)), Assumption 2 and 3 give that Va € inf 


(Qr a — ga (x)| 2 0 as t — œ. 


Let T and 6 be defined as in Lemma 2 (with the gin replaced by Q; a). In the proof of Lemma 2, 
g'(x;) :— maxgea ga(x;) +4 is used as a target for samples associated with actions in af? € A\1 to 
fall below and the sample associated with action 1 to fall above. The assumptions do not restrict 
from occurring the event that there exists an action a in jq] such that, for all t > T, fia (x) > 
g* (x;), thus making it impossible for Q, a to fall below g*(x;). However, Assumption 5 can be used 
to imply that there exists a 0; > 0 such that Vaf” € 4" and Vt > T 


f at (x) < sup.$ — 64. (27) 
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Assumption 1 and inequality (27) then imply that, for all actions af” € 4f" 1, there exists an €,iin > 0 
such that 
P(O, am < max(g*(x;), sup S — 81) | xi, t > T) > Egfin 


and also that there exists an €; > 0 such that 
P(Q > max(g* (x;), sup.$ — 51)| L, x,t > T) >£. 


The proof then follows in a similar manner to that of Lemma 2, with the om replaced by Q,,. M 


In the case of the LTS algorithm, it is not necessary for the generated policy to be GLIE for 
Theorem 1 to hold. Assumptions are only made on total action value estimates, that is, the sum 
of exploitative and exploratory value, and it is not necessary that the exploratory value converges 
to zero. Exploitative value estimates are not used explicitly for the LTS algorithm and Lemma 
3 is included in this work for completeness. In the case of the OBS algorithm, it is important 
that Assumption 3 holds so that the policy is GLIE, since exploitative values are used explicitly. 
The total action value can be equal to the exploitative value estimate so it is important that the 
exploitative estimate converges to the same value as the LTS samples. Obviously, this would hold 
if the posterior expectation is used as we suggest, however our framework allows for the use any 
functions of the regressor satisfying Assumptions 3 and 5 when implementing the OBS algorithm 
and the convergence result will still hold. 


Lemma 5 Jf Assumption I, 2, 3 and 5 hold, then the OBS algorithm policy is GLIE. 


Proof The proof is similar to that of Lemma 3, replacing p^ with f; 4, replacing on with Q; a and 
using the fact that 
fau) = max(0, frig (x+)). 


Under Assumptions 1—5, we have that the LTS samples, n. and the exploitative values, fs (x) 


are consistent estimators of the true expected rewards, f,(x,) and that infinite exploration is guar- 
anteed by Lemma 4. Therefore, we have that the OBS samples, Q; a converge in probability to the 
true expected rewards, f,(x;), as t — œ. We can therefore prove that the OBS algorithm satisfies 
convergence criterion (2) using a similar method to that used for the proof of Theorem 1. 


Theorem 2 /f Assumptions 1—5 hold, then the OBS algorithm will produce a policy satisfying con- 
vergence criterion (2). 


Proof By Assumption 2, 3 and 4 and the infinite exploration guaranteed by Lemma 4, we have that 
Vac A : 
[Qia — fa(x)] > 0 as t oo 


since Q; a = max(Q/, fra(%)). The remainder of the proof follows as in the case of Theorem 1 
(replacing om with Q; a). BH 
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4. Case Studies 


In this section, we aim to validate claims made in $1.2 regarding the short term performance of the 
OBS algorithm by means of simulation. We use the notion of cumulative pseudo-regret (Filippi 
et al., 2010) to assess the performance of an algorithm. The cumulative pseudo-regret measures the 
expected difference between the reward the algorithm receives and the reward that would be received 
if the regression functions were known in advance so that an optimal arm can be chosen on every 
timestep; it is a standard measure of finite-time performance of a bandit algorithm. Our definition 
differs slightly from that of Filippi et al. (2010) since we do not restrict attention to generalised 
linear bandits. 


Definition 4 The cumulative (pseudo) regret, Rr, at time T is given by 


T 
Rr = Y [£60 - fa (9). 


We compare the performance of OBS to that of LTS and various recently proposed action se- 
lection methods in simulated Bernoulli bandit and linear regression problem settings in $4.1 and 
$4.2 respectively. We also consider a real-world version of the problem using data that relates to 
personalised news article recommendation, the Yahoo! Front Page Today Module User Click Log 
Data Set (Yahoo! Academic Relations, 2011). Graepel et al. (2010) suggest using LTS to deal with 
the exploration-exploitation dilemma in a similar sponsored search advertising setting. We compare 
the OBS performance to that of LTS on the Yahoo! data and obtain results indicating that OBS 
performs better in the short term. 


4.1 Bernoulli Bandit 


In the multi-armed Bernoulli bandit problem, there is no regressor present. If the agent chooses 
action a on any timestep then a reward of 1 is received with probability p; and 0 with probability 
l — pa. For each action a, the probability p; can be estimated by considering the frequency of 
success observed in past selections of the action. The agent needs to explore in order to learn the 
probabilities of success for each action, so that the action yielding the highest expected reward can 
be identified. The agent needs to exploit what has been learned in order to maximise expected 
reward. The multi-armed Bernoulli bandit problem presents a simple example of the exploration- 
exploitation dilemma, and has therefore been studied extensively. 


4.1.1 PROBLEM CONSIDERED 

In this case, we let the prior information, Jp, consist of the following: 
e The number of actions, A. 
e (Va € A)(Vt € T){ fai) = pa} for pa € (0,1) unknown. 


= —pa With probability 1 — pa, 
$ VY = { 1—pa with probability pa. 


e For each action a € A, the prior distribution of f; is Beta(1, 1) (or equivalently U(0, 1)). 
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4.1.2 LTS AND OBS IMPLEMENTATION 


Let 7ra denote the value of the reward received on the timestep where action a was picked for the 
tth time. For arbitrary a € A define 


Nt a 


Sta — y. Fra. 
t=1 


Posterior expectations (using flat priors, as indicated by Jọ) can be evaluated easily, so we define 


exploitative value as 
a. Sta + 1 


ta: 


EE mat? 


The posterior distribution of p; given J; has a simple form. We sample 





i = Beta (sra E l,a — Sta t 1). 


and set 


Qr a = max(Qj5, fra): 


4.1.3 CONVERGENCE 


In this section, we check explicitly that Assumptions 1-5 are satisfied in this Bernoulli bandit set- 
ting, therefore proving that the LTS and OBS algorithms generate policies satisfying convergence 
criterion (2). 


Lemma 6 The LTS total value estimate, QI", satisfies Assumption 1, for all a € A. 


ta 


Proof Leta € A, T 7 0, Ir and M € (0,1) be arbitrary. For any t > T and I, = IrUrr,...,ri-i,ar, 
...,€-1} with a; Z afors € (T,...,t — 1), the posterior distribution of fa given J, will be the same 
as the posterior distribution of fa given Ir (since no further information about f; is contained in /,). 
Let 


1, 
£:— 5 min [P(Q77, <M| Ir), (QT, >M| Ir)}. 


We then have that 
P(Q >M|L) >€ 


and 





P(O™ < M|L) >€. 


Lemma 7 The LTS total value estimate, QI", satisfies Assumptions 2-4, for all a € A. 


tw 


Proof Posterior expectations are given by 





 . Statl 

fta id Nt a +2 
EMIDVET 
Mat2 c 
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Using the strong law of large numbers, we then have 


























nta ~x 
: A , L fta E 
lim fa= lim Z=" = E(r |a, =a) = pa = fa- (28) 
Nate Nae Nt a 
Therefore, it is the case that 
^4 ^ a.S. 
EOTS) — fra —> fa aS nia > ©. (29) 


By considering the variance of the LTS samples, we get 


(Sta ze I) (ria — Sta t 1) 





Var(Q |I) = 





a (Nya - 2)? (nia +3) 
(nra + 2p 
(nti, F 2)? (nia +3) 
1 a.S. 
iE (nia +3) = Oas Nta — co. (30) 
From (29) and (30), we then have Va € A 
ns B fy as nya — oo. (31) 


Note that since fy = pa < 1 for each a € A, and |A| < ee, convergence result (31) shows that As- 
sumptions 2 and 4 hold and convergence results (31) and (28) combined show that Assumption 3 
holds. a 


Lemma 8 The exploitative value estimate, ha satisfies Assumption 5, for all a € A. 


Proof Let a € A, T 0 and Ir be arbitrary. For any t > T and I, = IrU {rr,...,4~-1,a7,---,G-1} 
with a, Z a for s € (T,...,t — 1), 


nta =NTa and  sj4— ST. 





Therefore 
A Statl ~nratl 1 1 
i= i = sup S — — ——-, 
"e dpa T2 tre +2 nTa T2 nra 2 
so that the assumption is satisfied with 6 = aD a 


Proposition 1 Within the described Bernoulli bandit setting convergence criterion (2) is satisfied 
when the LTS or the OBS algorithm is used. 


Proof Assumptions 1-5 hold, so the proof follows directly from Theorems 1 and 2. E 
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4.1.4 EXPERIMENTAL RESULTS 


We parameterise a Bernoulli problem of the described form with a vector of probabilities, (p1,..., 
Pa), corresponding to the expected rewards for the actions in 4. We simulate the problem in four en- 
vironments with parameters (0.8,0.9), (0.8,0.8,0.8,0.9), (0.45,0.55) and (0.45,0.45,0.45,0.55). 
It is well known that the variance of a Bernoulli random variable is maximised when the associated 
probability of success is 0.5. We choose to consider the four environments mentioned to provide 
‘low variance’ and ‘high variance’ versions of the problem and to investigate the effect of increasing 
the number of actions. 


For each problem environment, the process is run for 8000 independent trials. A time window 
of T = {1,...,5000} is considered on each trial. A trial consists of sampling the potential rewards 
r; 4 ~ Bernoulli(p;) for each t € 7 and a € A and running all algorithms on the same set of po- 
tential rewards, whilst recording the regret incurred. We compare the performance of the LTS and 
OBS algorithms to that of UCB-Tuned, MOSS, UCB-V, KL-UCB and KL-UCB« in each of the 
four simulated environments. The UCB-Tuned and MOSS algorithms are implemented exactly as 
described by Auer et al. (2002) and Audibert and Bubeck (2010) respectively? The UCB-V algo- 
rithm is implemented as described by Audibert et al. (2007), with exploration function and tuning 
constants set to the ‘natural values’ suggested.t The KL-UCB and KL-UCB- algorithms are imple- 
mented as described by Garivier and Cappé (2011), with constant c = 0, as used in their numerical 
experiments. 


The results of the simulations are summarised in Figures 1—4. The left hand plots show cumu- 
lative regret averaged over the trials. The right hand plots show boxplots indicating the distribution 
of final cumulative regret over trials. We consider cumulative regret averaged over trials since this 
provides an estimate for the expected cumulative regret, E(Rr), where the expectation is taken with 
respect to the regressor sequence and the reward and action sequences under the proposed algo- 
rithm, a much more meaningful measure than the cumulative regret incurred over any one trial. We 
plot the average cumulative regret on a logarithmic timescale, so that one can get an indication as to 
whether an algorithm has a optimal rate of regret. 














We first note that, in the cases considered, the MOSS and UCB-V algorithms perform relatively 
poorly, despite proven regret guarantees. The left hand plots in Figures 1 and 2 indicate that the 
KL-UCB+ algorithm has the best performance (in terms of expected regret) for the ‘low variance’ 
problem environments, whereas Figures 3 and 4 indicate that the UCB-Tuned algorithm has the 
best performance in the ‘high variance’ problem environments. Both the OBS and LTS algorithms 
display highly competitive performance in all cases considered, with the OBS algorithm consistently 
outperforming the LTS algorithm, as predicted in Section 1.2. It is also indicated that increasing 
the number of actions from 2 to 4 widens this performance gap between OBS and LTS. There 





3. We implement the MOSS algorithm with the time horizon known. We note that the algorithm can be run without 
knowledge of the horizon using the ‘doubling trick’ (Cesa-Bianchi and Lugosi, 2006), whereby the horizon used in the 
algorithm is originally set to 2 and then doubled whenever t exceeds the assumed horizon. In preliminary numerical 
experiments, the version using knowledge of the time horizon slightly outperformed (with respect to averaged final 
cumulative regret) the ‘doubling trick’ version in of all problem environments tested, so we choose to use the former 
in comparisons. 

4. For the UCB-V algorithm, we use exploration function £, = logt and constant c = 1/6, in the notation of Audibert 
et al. (2007). In preliminary numerical experiments, this version outperformed the version used in the numerical 
experiments section of Audibert and Bubeck (2009) (with c — 1 instead) in all four problem environments tested, and 
so is used for comparisons. 
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Figure 1: Performance of various algorithms in Bernoulli bandit simulation with parameters (p1, p2) = 
(0.8,0.9). Left: Cumulative regret averaged over trials. Right: Distribution of cumulative re- 
gret at time t = 5000. Results based on 8000 independent trials. 


is no method tested that outperforms OBS in all four problems and the OBS algorithm displays 
performance that is never far from the leading algorithm. 


The boxplots on the right hand side of the Figures 1—4 indicate that LTS, OBS, UBC-tuned and 
(to a lesser extent) KL-UCB+ are all ‘risky’ algorithms, when compared to the others. If one was 
risk-averse, then the KL-UCB, MOSS and UCBV algorithms are suitable options.? It is also worth 
noting that the regret distribution associated with the OBS algorithm seems to have a fatter upper tail 
than the LTS algorithm but the LTS algorithm has more variance near the median (which is higher 
than the OBS median in the four cases considered). A theoretical analysis on the concentration of 
regret for the OBS and LTS algorithms is desirable so that this can be investigated further, although 
we leave this to future work. 


Finally, in Figure 5, we present plots of the reward ratio (2) through time, for the first 100 trials 
of the first experimental condition, in order to demonstrate actual results proved in the theoretical 
part of the paper. The ‘almost sure’ nature of the convergence of this quantity is observed, in that on 
some runs there is a period to begin with in which the ratio 'sticks' before asymptoting towards 1, 
whereas most runs converge quickly towards the asymptote. An identical phenomenon is observed 
in the other experimental conditions. 





5. Note that Audibert and Bubeck (2009) give theoretical results on the concentration of the regret incurred by the 
UCB-V algorithm, as well as on its expectation. 
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Figure 2: Performance in Bernoulli bandit simulation with parameters (0.8,0.8,0.8,0.9). Note that the 
curves for the OBS algorithm and the KL-UCB- algorithm are virtually coincident. 


4.2 Linear Regression 


In this case, we study a form of the problem in which the expected reward for each action is a linear 
function of an observable scalar regressor and the reward noise terms are normally distributed. The 
learning task becomes that of estimating both the intercept and slope coefficients for each of the 
actions, so that the action yielding the highest expected reward given the regressor can be identi- 
fied. The exploration-exploitation dilemma is inherent due to uncertainty in regression coefficient 
estimates caused by the reward noise. 


4.2.1 PROBLEM CONSIDERED 


In this case, we let the prior information, Jp, consist of the following: 


e The number of actions, A = 4. 


e 
S 
M 
A 


vt € T)x ~ U(—0.5,0.5)}. 


(ve T) falx) = Bia + Box } for Bi a, B2,a € IR unknown. 


(Vt € T) {zra ~ N(0,02)) for o; € R unknown. 


Va € 4 


) 
) 

Va € A){The (improper) prior distributions for B1 a and B», are flat over R}. 
) 


{The (improper) prior distribution of 6? is flat over R+}. 
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Figure 3: Performance in Bernoulli bandit simulation with parameters (0.45,0.55). Note that the curves for 
the OBS algorithm and the KL-UCB+ algorithm are virtually coincident. 
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Figure 4: Performance in Bernoulli bandit simulation with parameters (0.45,0.45,0.45,0.55). 


2093 


MAY, KORDA, LEE AND LESLIE 
































2 o 
9e - 9e - 
o 9 
© 4 $ 4 
E S 
E o E o 
8 4 8 4 
24 s l6 
e 2 
= = 
3 ©% 
o [s 
p 34 BP à. 
a o oO o 
> z 
o oO 
oc tr 
o Bed 0 gd 
> m > 
H o O o 
E E 
9 4 9 4 
E o 
o o 
$8. $8. 
E E 
T T T T T T T T T T T T 
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 
Time, t Time, t 


Figure 5: Convergence of the ratio (2) in the first 100 Bernoulli bandit simulations with parameters 
(pi, p2) = (0.8,0.9). 


4.2.2 LTS AND OBS IMPLEMENTATION 


Denote estimators at time t of the parameters b, and o, for a = 1,...,A as Bia and 6; a respectively, 
where b, = (Bia, Bo.,)*. For all a € A, denote J = {T € {1,...,t— 1} : ar =a} and the ra- 
vectors of regressors and rewards observed at time steps in J; a as X; a and r; a respectively. Denote 
the nz a x 2 matrix formed by the concatenation of 1,, , and X;,q as X; a, where l,, is the n ;- vector 


with every component equal to 1. Let Bia be given by the least squares equation 
Bia :— (XP Xia) X7 ra. 


Let us also denote x, = (1,x;)’. Posterior expectations (using flat priors, as indicated by Jp) can be 
evaluated easily, so we define exploitative value as 


falx) = x b, ,. 


Let 6; a be given by 








Nt a — 


^ 1 A " 
Ora:— y 2 (rra E X, abra)! (ria = X, abia) 


and let U; a ~ t,,-2- We define the LTS exploratory value as 





Fr Gt) = le Vx (XX) U; a- (32) 
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The LTS total value is given by 
To (xr) = pts + fra x). 
The OBS total value is given by 
Q; a(x) = max(Oh, fi. (x:)). 
Note that if n,a € (0, 1,2) then the posterior distribution of f,(x;) is improper. In these situations, 
we sample values from N(0, 10°) to obtain Q7». 
4.2.3 CONVERGENCE 


In this section, we check explicitly that Assumptions 1—5 are satisfied in this linear regression set- 
ting, therefore proving that the LTS and OBS algorithms generate policies satisfying convergence 
criterion (2). 

Lemma 9 The LTS total value estimate, QI", satisfies Assumption 1, for all a € A. 


ta 


Proof Leta € A, T > 0, Ir and M € R be arbitrary. For any t > T and I, = IrUÍrr,...,ri i,am,..., 
à; 1) with a; Z a for s € (T,...,t — 1}, the posterior distribution of b, and G2 given J, will be the 
same as that given Ir (since no further information about fa is contained in J,). In particular for each 
regressor x, f; a(x) = Îr (x), and [o (x) has the same distribution given J, as it did given Ir. Define 


eig min min {PORE « M — fra) | ir) PUPA) <M — fral) 10)). 


Since Q} = fia) + fI (x;), we then have that 
P(Q >M|L) >€ 


and 





P(Q7^ < M|I) » €. 
p 


Lemma 10 (taken from Eicker, 1963) is used to prove the consistency of the least squares esti- 
mators of the regression coefficients. 


Lemma 10 The least squares estimators b, a t = 2,3,... converge in probability to ba as n; a — © 
if and only if Amin (X1, Xia) — œ dS nj 4 — ©, where Amin (X1 Xi) is the smallest eigenvalue of 
X7 Xia 


" x 1 ^ P 
Lemma 11 The exploitative value estimate fia(x;) — fa(x:) as nia — ©. 


Proof Let x; a denote the value of the regressor presented on the timestep where action a was picked 
for the ith time. 
nta ~ 
Nt a Y Xia 
T 
X; Xia = á á 
as a iQ 
Y. YO. 
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The smallest eigenvalue is given by 





nt, 
Amin = 2 








2 
Nt a Nt a Nt a Nta 


Na ~ Nt a ~ Nt a ~ ta ~ 2 
Y 4 -1 G5) (Eie [x fia) )| 





Therefore, since VarX > 0, we have that 























=o, 








lim Amin = lim Ms ca - iae ava 
99 n; 





Using Lemma 10, we then have 
f P 
[ba — ba > 0 aS m a > oo. 
Multiplying on the left by x’ then gives 


fas) — fix) B Oas Ng — o9. 


Lemma 12 The LTS total value estimate, QI", satisfies Assumptions 2-4, for all a € A. 


tw 


Proof To prove this lemma, we need to show that fea (x) a 0 as n; a — © for all actions a € A. In 
order to do this, we consider each component in the product that forms dealt) (see (32)). Firstly, 
we consider U; a. It is a well known (as is described in Zwillinger, 2000) that 


U; a 2 N(0,1) as n a — eo. (33) 











EA : a P 3 
Next, we consider 6; a. Using the facts that b, a — ba as n; — ©, zia = Ti — fa(x;) and Elz, 4] — 0 
we have that 











Da — 


A 1 ^ A 
Ota :— y 2 (rra EM X, abra)! (ria = X, abia) 





P 1 
? y (Tra X, aba)! (ria = X, aba) as Nt a — oo 
Ng —2 


“4 / Elz? al aS Nj a — oo 


= V Vara] + [E[z. 4] 


= 4) Varie 2] = Oa. (34) 
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Finally, let us consider x (X7. b onm X,. We start by looking at the determinant of X, 1, Xt, a. We have 
that 


1 Na 























ay M E[x2] — ET) as Ny a — oo 
Var|x;]. (35) 
Using the standard formula for inverting a 2 x 2 matrix, we get 


LR RÀ Nt a 
x! (X Xen) x X; = det X7, X, ,, X mae Yr, — 2x; ¥ Xia + ma) 
t.a 


i=1 i-l 


Na 


1 M m 
X det X7, X,. " (». ET biia (= a LA a) 1 
Nt a Nt a 
+ (Qu) ursa tama) 














Nt, a cil 
1 1 Titia Nt a 
T -2 
Nt a a det X7, X, a (= 7 (Ls. 2 Xt Li 5 +x n; a) 
1 1 Mta 1 Nt a 
Nt a + deX7 Xa (^ f Ye). p uU | 
za ("lg as 
zm x —Xx . 
nia det XX alma t 











Using (35), (36) and the facts that Var[x,] > 0 and both x, and E[x,] are bounded, we have that 


2 
TXT Xia) ee : [Eis -x| 250 = (37) 
x X I as n co, 
A 2s á Na n; a Var[x,] ha 




















Equations (33), (34) and (37) imply that f^ (x+) B Oas "a — œ. Therefore, since Q/) = fax) + 
fi> (x+), Lemma 11 gives us that 


Th — fao) — 0 as nya — oo, 


satisfying Assumptions 2 and 4. This same holds for $5 (x;), hence Assumption 3 is satisfied too. 
a 


Lemma 13 The exploitative value estimate, fia (x;), satisfies Assumption 5, for all a € A. 


Proof Let a € A, and T > 0 be rra For any t > T and J, = Ir U (rr... Tis 


aT,...,dy1) With a, Z a for s € (T,...,t — 1}, the regression coefficients b, a are equal to vhs a 
Hence 
max x)= max a 
PUT ) reno 80.5] fra(x) 
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The Assumption then follows by noting that S$ = R. a 


Proposition 2 Within the described linear regression setting convergence criterion (2) is satisfied 
when the LTS or the OBS algorithm is used. 


Proof Assumptions 1-5 hold, so the proof follows directly from Theorems 1 and 2. | 


4.2.4 EXPERIMENTAL RESULTS 


The process is run for 10000 independent trials. A time window of T = {1,...,5000} is considered 
on each trial. The regression coefficients for the actions are set to (Bia,B2,a) = (0,1), (0, — 1), 
(—0.1,0), (0.1,0) for a — 1,2,3,4 respectively. The resulting expected reward functions are plotted 
in Figure 6. For each trial: 


e Vt € T sample x, ~ U(—0.5,0.5) 

e Va € A and Vt € T sample zra ~ N(0,02) with og = 0.5 

e Va € A and Vt € T evaluate potential reward r; a = Bi a+ B2,aXt + zia 
e record the regret incurred using various action selection methods. 


We compare the performance of LTS and OBS to an interval estimation method (or LinUCB, in 
the terminology of Li et al., 2010) similar to that described in Pavlidis et al. (2008). However we 
use the posterior distribution of the mean to evaluate the upper confidence bound rather than using 
the predictive distribution. Specifically, the action selection rule used is given by 





a= sh fa (xi) + 614 CRETEI f ia 2 
where tyn denotes the quantile function of Student's ¢ distribution with n degrees of freedom evalu- 
ated at y. This ensures that the value estimates are consistent, that is, the value estimates converge 
to the true expected reward as associated actions are selected infinitely often. We implement the IE 
method with parameter values À = 0.01, A = 5 and A = 25. 

The results of the simulation can be seen in Figures 7 and 8. Figure 7 (left) shows cumulative 
regret averaged over the trials. The OBS algorithm displays the best performance (with respect to 
cumulative regret averaged over trials) in the problem considered, and this performance is signifi- 
cantly better than that of the LTS algorithm. It is also clear that the IE method performance is highly 
sensitive to parameter choice. The best parameter choice in this case is A = 5, however, it is not 
clear how this parameter should be chosen based on the prior information provided. In general, if 
À is ‘too high’, then too much emphasis is put on short term performance and if À is ‘too low’ then 
too much emphasis is put on long term performance. This is indicated by the curves for the A = 25 
and A = 0.01 methods respectively. Figure 7 (right) shows boxplots indicating the distribution of 
final cumulative regret over trials. It is indicated that the IE methods become riskier as the signifi- 
cance parameter used is increased and that the significance parameter provides a way of trading off 
median efficiency and risk. The only method to compete with OBS on cumulative regret averaged 
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Figure 6: The expected reward functions for the 4 actions in linear regression simulation. 


over trials is the A = 5 IE method, however the OBS final regret distribution is more concentrated 
than the À — 5 IE method. In Figure 8, we present plots of the reward ratio (2) through time, for 
the first 100 experiments, in order to demonstrate actual results proved in the theoretical part of the 
paper. Although convergence of the ratio has not occurred after the 5000 iterations, it is clear that 
the ratio is improving over time. 


4.3 Web-Based Personalised News Article Recommendation 


We now consider the problem of selecting news articles to recommend to internet users based on 
information about the users. In our framework, the recommendation choice corresponds to an action 
selection and the user information corresponds to a regressor. The objective is to recommend an 
article that has the highest probability of being clicked. 


We test the performance of the LTS and OBS algorithms on a real-world data set, the Yahoo! 
Front Page Today Module User Click Log Data Set (Yahoo! Academic Relations, 2011). A similar 
study is performed by Chapelle and Li (2011). However we consider multiple trials over a short 
time horizon, as opposed to Chapelle and Li’s single trial over the full data set, to investigate the 
short term performance of the algorithms, and in particular to address the claim made in Section 
1.2 regarding a potential short term benefit of using OBS over using LTS. It is necessary to average 
results over multiple trials given the randomised nature of the OBS and LTS algorithms. We also test 
the LinUCB algorithm of Li et al. (2010) with various parameter settings to provide a benchmark 
for comparison. 
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Figure 7: Performance of various algorithms in linear regression simulations. Left: Cumulative 
regret averaged over trials. Right: Distribution of cumulative regret at f = 5000. Results 
based on 10000 independent trials. 
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Figure 8: Convergence of the ratio (2) in the first 100 linear regression simulations. 
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4.3.] USE OF DATA SET 


The data set describes approximately 36M instances of news articles being recommended to inter- 
net users on the Yahoo! Front Page Today Module at random times in May 2009. The form and 
collection of the data set are both described in detail by Li et al. (2010). For each recommendation, 
the data contains information concerning which article was recommended, whether the recommen- 
dation was clicked and a feature vector describing the user. The recommended articles are chosen 
uniformly at random from a dynamic pool of about 20 choices, with articles being added and re- 
moved at various points of the process. The user features, x;, are given as vectors of length 6 with 
one component fixed to 1, and are constructed as described by Li et al. (2010). The reward is defined 
to be 1 if the recommendation is clicked and 0 otherwise. 

The use of past data presents a problem in evaluating a decision-making algorithm. Specifically, 
within the data a random article is recommended on each instance, which might well be different to 
the article that the decision-making algorithm selects during testing. This problem can be avoided 
by implementing the unbiased offline evaluator procedure of Li et al. (2011). Under this procedure, 
if the action selected by the algorithm does not match the action selected in the data point, the 
current data point, and subsequent data points, are ignored until a data point which matches user 
data and action selection occurs. The observed reward from this data point is then awarded to the 
algorithm, and the user data from the next recommendation instance in the data is used in the next 
evaluation step. 


4.3.2 ALGORITHM IMPLEMENTATION 


The LTS and OBS algorithms are implemented using the logistic regression model of Chapelle and 
Li (2011). It is assumed that there is an unknown weight vector, w4, for each article a € A such that 


1 





P(r, = 1|a, = a,x; = x) (1 +exp(—w,x))~ 





Approximate posterior distributions for each w, are estimated to be Gaussian with mean and vari- 
ance updates as described in Algorithm 3 of Chapelle and Li (2011). For our numerical experiment, 
we set the unspecified regularisation parameter of Chapelle and Li (2011) to 100. The LTS algo- 
rithm can easily be implemented by sampling weight vectors from the posteriors and selecting the 
article with the weight vector forming the highest scalar product with the current user feature vector. 
The OBS algorithm can easily be implemented by also considering posterior means of these scalar 
products. We also test the LinUCB algorithm, as implemented by Chapelle and Li (2011), with 
parameter Q set to each of 0.5, 1 and 2. 


4.3.3 NUMERICAL EXPERIMENTS 


As previously mentioned, our focus is short term performance averaged over numerous trials. We 
focus on the case of only 4 articles, and therefore remove all instances outwith these 4 articles from 
the data set. On each of 2,500 trials, we run each of the 5 algorithms until 5,000 interactions are 
accepted using data from the start of the supplied data set (Yahoo! Academic Relations, 2011); we 
use only data from the start of the data set to avoid confounding the algorithm evaluations with the 
non-stationarity of the data. 

The concept of regret is difficult to use as a performance measure in this setting, since there is no 
true model given for comparison. We instead consider the percentage of past timesteps resulting in 
clicks, otherwise known as the click-through rate (CTR), and percentage benefit of OBS over LTS 
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Figure 9: Normalised Click Through Rate through time for various algorithms. Results averaged 
over 2,500 independent trials. 


with respect to CTR. Again, to avoid issues of non-stationarity, we normalise all CTRs by dividing 
by the CTR achieved (on these four articles) in the original data set. 


The results of the experiment can be found in Figures 9 and 10. Figure 9 shows the normalised 
CTR for all 5 algorithms, averaged over all 2500 runs. It is clear that the performance of the 
LinUCB algorithm is sensitive to parameter choice; the version with parameter set to 1 performs 
much better than the version set to 0.5, and it is not clear in advance of implementing the algorithm 
which parameter will be optimal. As a caveat on these results, it is worth noting that the portion of 
the data set used for each trial is the same, and also that the LinUCB algorithms are deterministic 
given past information (except in the case of a tie in action values), so it is hard to extrapolate 
general results relating to the performance of LinUCB algorithms. Furthermore Chapelle and Li 
(2011) explain that the performance of the LinUCB algorithm degrades significantly with increasing 
feedback delay, while the LTS and OBS algorithms are more robust to the delay, so the strong 
performance of the highest-performing LinUCB algorithm in this experiment should not be taken 
as conclusive evidence of high real-world performance. Unfortunately it is not possible to produce 
plots comparable to Figures 5 and 8 in this case since the true optimal actions are not known. 
Figure 10 shows the difference in performance of OBS and LTS, expressed as a percentage of LTS 
performance, averaged over all 2500 runs. It is clear that the OBS algorithm outperforms the LTS 
algorithm across the time period considered, validating the intuition in Section 1.2. The short term 
improvement is small, but in many web-based application, a small difference in performance can be 
significant (Graepel et al., 2010). 
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Figure 10: OBS CTR as a percentage improvement of LTS CTR through time. Results averaged 
over 2,500 independent trials. 


5. Discussion 


The assumptions made for the theoretical results in Section 3 are mild in the sense that one would 
expect them to hold if the true posterior distributions and expectations are used. It is worth noting 
that convergence criterion (2) is satisfied even when approximations to the posterior distributions 
and expectations for the f,,(x;) are used with the LTS and OBS algorithms, so long as the relevant 
assumptions are satisfied. Hence, convergence is guaranteed for a large class of algorithms. 

We have seen that both the LTS and the OBS algorithms are easy to implement in the cases 
considered. They are also computationally cheap and robust to the use of posterior approximations, 
when compared to belief-lookahead methods, such as Gittins indices. The simulation results for 
the OBS algorithm are very encouraging. In every case, the OBS algorithm outperformed the LTS 
algorithm and performed well compared to recent benchmarks. 
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